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Abstract 

Background: We study the statistical properties of fragment coverage in genome sequencing 
experiments. In an extension of the classic Lander- Waterman model, we consider the effect of 
the length distribution of fragments. We also introduce the notion of the shape of a coverage 
function, which can be used to detect abberations in coverage. The probability theory under- 
lying these problems is essential for constructing models of current high-throughput sequencing 
experiments, where both sample preparation protocols and sequencing technology particulars 
can affect fragment length distributions. 

Results: We show that regardless of fragment length distribution and under the mild assump- 
tion that fragment start sites are Poisson distributed, the fragments produced in a sequencing 
experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We 
then study the jump skeleton of the the coverage function, and show that the induced trees are 
Galton- Watson trees whose parameters can be computed. 

Conclusions: Our results extend standard analyses of shotgun sequencing that focus on cover- 
age statistics at individual sites, and provide a null model for detecting deviations from random 
coverage in high-throughput sequence census based experiments. By focusing on fragments, we 
are also led to a new approach for visualizing sequencing data that should be of independent 
interest. 

1 Introduction 

The classic "Lander- Waterman model" [U] provides statistical estimates for the read coverage 
in a whole genome shotgun (WGS) sequencing experiment via the Poisson approximation to the 
Binomial distribution. Although originally intended for estimating the extent of coverage when 
mapping by fingerprinting random clones, the Lander- Waterman model has served as an essential 
tool for estimating sequencing requirements for modern WGS experiments [T7]. Although it makes 
a number of simplifying assumptions (e.g. fixed fragment length and uniform fragment selection ) 
that are violated in actual experiments, extensions and generalizations [19i jl8j have continued to 
be developed and applied in a variety of settings. 

The advent of "high-throughput sequencing", which refers to massively parallel sequencing 
technologies has greatly increased the scope and applicability of sequencing experiments. With the 
increasing scope of experiments, new statistical questions about coverage statistics have emerged. 
In particular, in the context of sequence census methods, it has become important to understand 
the shape of coverage functions, rather than just coverage statistics at individual sites. 
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Sequence census methods [20] are experiments designed to assess the content of a mixture of 
molecules via the creation of DNA fragments whose abundances can be used to infer those of the 
original molecules. The DNA fragments are identified by sequencing, and the desired abundances 
inferred by solution of an inverse problem. An example of a sequence census method is ChlP-Seq. 
In this experiment, the goal is to determine the locations in the genome where a specific protein 
binds. An antibody to the protein is used to "pull down" fragments of DNA that are bound via 
a process called chromatin immunoprecipitation (abbreviated by ChIP). These fragments form the 
"mixture of molecules" and after purifying the DNA, the fragments are determined by sequencing. 
The resulting sequences are compared to the genome, leading to a coverage function that records, 
at each site, the number of sequenced fragments that contained it. As with many sequence census 
methods, "noise" in the experiment leads to random sequenced fragments that may not correspond 
to bound DNA, and therefore it is necessary to identify regions of the coverage function that deviate 
from what is expected according to a suitable null model. 

The purpose of this paper is not to develop methods for the analysis of ChlP-Seq (or any other 
sequence census method), but rather to present a null model for the shape of a coverage function 
that is of general utility. That is, we propose a definition for the shape of a fragment coverage 
function, and describe a random instance assuming that fragments are selected at random from a 
genome, with lengths of fragments given by a known distribution. The distinction between our work 
and previous statistical studies of sequencing experiments, is that we go beyond the description of 
coverage at a single location, to a description of the change in coverage along a genome. 

2 The shape of a fragment coverage function 

We begin by explaining what we mean by a coverage function. Given a genome modeled as a string 
of fixed length a coverage function is a function / : {1, . . . , A^} — > Z>o. The interpretation 
of this function, is that f{i) is the number of sequenced fragments obtained from a sequencing 
experiment that cover position i in the genome. It is important to note that is typically large; 
for example, the human genome consists of approximately 2.8 billion bases. Because A'^ is very 
large, we replace the finite set {1, . . . , A^} with M, and re-define a coverage function to be a function 
/ : R — > Z>o- This helps to simplify our analysis. 

We next introduce an object that describes a sequence coverage function's shape. Our approach 
is motivated by recent applications of topology including persistent homology [21 [21] and the use of 
critical points in shape analysis [llElE]. For a given coverage function / : M — > Z>0) we will define 
a rooted tree, which is a particular type of directed graph with all the directed edges pointing away 
from the root. This tree Tf is based on the upper- excursion sets of f: Uh '■= {{x, f{x))\f{x) > h}, 
h G Z>o and keeps track of how the sets Uh evolve as h decreases. Long paths in Tf represent 
features of the coverage function that persist through many values of h. 

Specifically, for each h G Z>o, let Ch denote the set of connected components of the upper- 
excursion set Uh- We define the rooted tree Tj = (V, E) as follows 

• Vertices in V correspond to the connected components in the collection {Ch}hei^o 

• {i,j) E E provided their corresponding connected components q E C/^. and Cj G Chj with 
hi < hj satisfy hi = hj — 1 and cj C q. 

Note that the root of Tf corresponds to the single connected component in Co- The tree Tf is very 
similar to a contour tree [H §4.1], which is built using level sets of a function, and a join tree [3]. 
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Indeed, suppose we ignore every vertex that is adjacent to only one vertex with greater height. 
Then, the remaining vertices of Tf correspond to (equivalence classes of) local extrema of /. Each 
local maximum of / yields the birth of a new connected component as we sweep down through 
h £ Z>o while a local minimum of / merges connected components. Since we do not require / 
to have distinct critical values (as is frequently assumed), the vertices in Tf can have arbitrary 
degrees, as is depicted in Figure [Tp. 

In the sequel, we will use the following equivalent characterization that can be found in ^ §2.3]. 
Given a coverage function / : M — > Z>o with /(a) = f{b) = and f{x) > for x G {a,b), we 
form an integer- valued sequence xq, . . . , X2n that records the changes in height of / on the interval 
[a, b]. The sequence xq, . . . , X2n consists of the y values that / travels through from xq := f{a) = 
to X2n '■= fib) = and satisfies 

XQ = X2n = 0, 

a:j > for < i < 2n, 
\xi — Xi-i\ = 1 for 1 < i < 2n. 

Such a sequence is called a lattice path excursion away from 0. Next, we define an equivalence 
relation on the set {0, 1, . . . , 2n} by setting 

i = i <;=^ Xj = x-i = min Xh- 

^ i<k<j 

The equivalence classes under this relation are in 1 : 1 correspondence with the connected compo- 
nents in the upper-excursion sets of f\yabY equivalence class is {0,2n}, and if {^i, . . . ,ip} is 
an equivalence class with Q < ii < i2 < ■ ■ ■ < ip then Xj^-i = Xi^ — 1, whereas Xi^-i = Xi^ + 1 for 
2 < q < p. Conversely, any index i with Xi-i = Xj — 1 is the minimal element of an equivalence 
class. We use the minimal element of each equivalence class as its representative. Thus, we can 
view the vertices of Tji^ as the set {0} U = Xi — 1}. Two indices ii < 12 are adjacent in 

S\[a b] P'^o'^i'^sd = Xi-^ + 1 and Xk > Xi^ for ii < k < 12- Figure jlj gives an example of a coverage 
function together with its lattice path excursion (0, 1, 2, 3, 4, 3, 2, 3, 4, 5, 4, 3, 2, 3, 2, 1, 0) and rooted 
tree. The minimal elements of each equivalence class in Figure [Tp are depicted with red squares. 




Figure 1: A coverage function (A) with its lattice path excursion (B) and rooted tree (C). 
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3 Planar Poisson processes from sequencing experiments 



In order to model random coverage along the genome, we use a Poisson process to give random 
starting locations to the fragments. Specifically, suppose that we have a stationary Poisson point 
process on M with intensity p. At each point of the Poisson point process we lay down an interval 
that has that point as its left end-point. The lengths of the successive intervals are independent 
and identically distributed with common distribution p. We will use the notation X for a coverage 
function built from this process and Xt for the height at a point t. 

Let ti, • ■ ■ be the left-end points and li,l2,- " be the corresponding lengths of intervals. The 
interval given by (tj, k) will cover a nucleotide to provided ti < to and U + k > to. We can view this 
pictorially by plotting points {{tj,lj)} in the plane. Then — the number of intervals covering 
to — is the number of points in the triangular region below. We now recall the definition of a two- 



dimensional Poisson process and refer the reader to [10", §6.13] or |4i §2.4] for the details. Suppose 
r is a locally finite measure on the Borel cr-algebra =^(M^). A random countable subset 11 of 
is called a non-homogeneous Poisson process with mean measure T if, for all Borel subsets A, the 
random variables N{A) := i^{A n IT) satisfy: 

1. A^(^) has the Poisson distribution with parameter r(^), and 

2. K Ai, - ■ ■ , A], are disjoint Borel subsets of M^, then A^(yli), • • • , N{A).) are independent ran- 
dom variables. 

The following theorem is a consequence of |14|, Proposition 12.3]. 

Theorem 3.0.1. The collection {{ti, /j)} of points obtained as described above is a non-homogeneous 
Poisson process with mean measure pm® p. Here m is Lebesgue measure on M. 

We compute the expected value ¥,[Xt] = pm® /u(wedge) : 




(t, Z)-plane 



Figure 2: A two dimensional view of a sequencing experiment. 






— oo 



p[{t — n, co))du 




■oo 



p{{s, oo))ds. 
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3.1 Fragment lengths have the exponential distribution 

We treat the simplest case first, namely the case where the distribution /i of fragment lengths is 
exponential with rate A. Then, we have oo)) = F{1 > s} = e"'^'^, and 

E{Xt) = p r e-^'ds = ^. 

Jo ^ 

Claim 1. The process X is a stationary, time-homogeneous Markov process. 

Proof. It is clear that X is stationary because of the manner in which it is constructed from a Poisson 
process on M? that has a distribution which is invariant under translations in the t direction; that 
is, the random set {(tj, /j)} has the same distribution as {(tj + 1, /j)} for any fixed t € M. Since fj, is 
exponential, it is memoryless, meaning for any interval length / with an exponential distribution 

F{1 >a + b\l> a} = F{1 > b}. 

This means that probability that an interval covers t2 knowing that it covers ti is the same as the 
probability that an interval starting at ti covers t2. Thus, the probability that Xt^ = k given Xt 
for t < ti only depends on the value of Xt^^. Indeed, in terms of time, F{Xt2 = k\Xt-^ = k'} depends 
only on ^2 — ^1- D 

More specifically, X is a birth-and-death process with birth rate l3{k) = p in all states k and 
death rate 6{k) = kX in state A: > 1. Note that as the exponential distribution is the only 
distribution with the memoryless property, we lose the Markov property when n is not exponential. 

To build the tree of g we are interested in the jumps of the coverage function f{t) = Xf. We 
hence consider the jump chain of X — a discrete-time Markov chain with transition matrix 



P{i,j) 



1, if i = and j = 1, 

— r^-, if i > 1 and ?' = i + 1, 
if i > 1 and i = i — 1, 
0, otherwise. 



Suppose now we have a lattice path excursion starting at 0. Given a vertex v of the associated tree 
at height k, we are interested in the number of offspring (at height /c + 1) of this vertex. Suppose 
io is the minimal equivalence class representative for vertex and suppose [io] = {io,ii, ■ ■ ■ ,«n} 
with io < ii < • • • < in- Then, we have Xi^ = fc for < r < n, Xj^+i = A: + lforO<r<n — 1, 
^in+i = ^ — 1, and > fc for io < t < in with t / some ir. From the Markov property, for 
< i < n, F{xi^+i = k + l\xi. =k} = and ¥{xi^+i = k- l\xi^ = k} = The resulting 

tree is a Galton- Watson tree with generation-dependent offspring distributions (see [H |9l [121 US] 
for more on Galton- Watson trees). Indeed, we have 

P{a vertex at height k has n offspring} ' 



p + \k J p + Xk 

which is the probability of n failures before the first success in a sequence of independent Bernoulli 
trials where the probability of success equals p^\f. . 
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3.2 Fragment lengths have a general distribution 

Suppose that we have a general distribution ^ for the fragment lengths. We observe X at some fixed 
"time" - which might as well be because of stationarity, and ask for the conditional probability 
given Xq that the next jump of X will be upwards. We know from the above that if ^ is exponential 
with rate A, then conditional on Xq = k this is p/{p + kX). 

Let T denote the time until the next segment comes along. This random variable has an 
exponential distribution with rate p and is independent of Xq jH §2.1]. If we condition on Xq = k, 
the two-dimensional Poisson point process must have k points in the region 

A := {{t,l) : -oo < t < 0, -t < / < oo}. 




Conditionally, these k points in A have the same distribution as k points chosen at random in 
A according to the probability measure 

P_rnm4B) 
pm p{A) 

However, in order that the next jump after is upwards, the two-dimensional Poisson point process 
must have no points in the orange region 

{{t, I) : -oo < t < 0, -t < I < T - t} 

as these segments end before time T. This leaves the k points lying in the blue region 

Bt := {{t, I) : -oo <t<0,T-t<l< oo}, 

which occurs with probability f j" ) . Thus, conditional on Xq = k, the probability 

that the next jump will be upwards is 



V/o°°/i((^^,c»))'^?^/ 



pe 



-P' dt. 
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Write p{k) for this quantity. A reasonable approximation to the jump skeleton Z of X is to take it 
be a discrete-time Markov chain on the nonnegative integers with transition probabilities 



P{i,j) 



1, if z = and j = 1, 

p{i), if z > 1 and j = i + 

1 — p{i), ii i > 1 and j = i — 1, 

0, otherwise. 



The resulting tree is then a Galton- Watson tree with generation dependent offspring distributions, 
where 

P{a vertex at height k has n offspring} = p{k)'^{l — p{k)). 



Example 3.2.1. Suppose fj, is the point mass at L (that is, all segment lengths are L). Then 

n{{u,oo)) 



1, u< L 
0, u>L 



and 



This gives 



IJ,{{u, oo))du 



du = L — t, t < L 
0, t>L. 



p{k) 



-f 

Jo 



pe 



'P*dt 



= C w'^pe-P^^-^'^^Ldw 
Jo 



ee-'' I w^e^'"dw for k>l, 



where 6 := pL = lK[Xo]. We integrate by parts and find that p{k) = Oe ^q{k) where 



q{k) 



we 



w=l 



w=0 



k 



Jo 



e^'^dw 



e^_k 

e e 



q{k-l) for k>2, 



which yields the recursion 



= 1 - - 1), k>2, with p{l) = l-l + - 



Solving explicitly, we obtain 



p{k) = ki\Y, 



J=0 



T- + 



0k 



for k > 1. 
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4 Discussion 



Our observation that randomly sequenced fragments from a genome form a planar Poisson process 
in (position, length) coorindates has implications beyond the coverage function analysis performed 
in this paper. For example we have found that the visualization of sequencing data in this novel 
form is useful for quickly identifying instances of sequencing bias by eye, as it is easy to "see" 
deviations from the Poisson process. An example is shown in Figure |4] where fragments from an 
Illumina sequencing experiment are compared with an idealized simulation (where the fragments 
are placed uniformly at random). Specifically, paired-end reads from an RNA-Seq experiment 
conducted on a GAII sequencer were mapped back to the genome and fragments inferred from the 
read end locations. Bias in the sequencing is immediately visible, likely due to non-uniform PGR 
amplification [11] and other effects. We hope that others will find this approach to visualizing 
fragment data of use. 



(ij)-planc 




(t,/)-plane 




100 200 300 400 500 500 700 



100 200 300 400 500 600 700 



Figure 4: (A) Fragments from a sequencing experiment shown in the (t, I) plane. (B) The spatial 
Poisson process resulting from fragments with the same length distribution as (A) but with position 
sampled uniformly at random. 

The "shape" we have proposed for coverage functions was motivated by persistence ideas from 
topological data analysis (TDA). In the context of TDA, our setting is very simple (1-dimensional), 
however unlike what is typically done in TDA, we have provided a detailed probabilistic analysis 
that can be used to construct a null hypothesis for coverage-based test statistics. For example, we 
envision computing test statistics |16j based on the trees constructed from coverage functions and 
comparing those to the statistics expected from the Galton- Watson trees. It should be interesting 
to perform similar analyses with high-dimensional generalizations for which we believe many of our 
ideas can be translated. There are also biological applications, for example in the analysis of pooled 
experiments where fragments may be sequenced from different genomes simultaneously. 

Indeed, we believe that the study of sequence coverage functions that we have initiated may 
be of use in the analysis of many sequence census methods. The number of proposed protocols 
has exploded in the past two years, as a result of dramatic drops in the price of sequencing. For 
example, in January 2010, the company Illumina announced a new sequencer, the HiSeq 2000, 
that they claim "changes the trajectory of sequencing" and can be used to sequence 25Gb per 
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day. Although technologies such as the HiSeq 2000 were motivated by human genome sequencing 
a surprising development has been the fact that the majority of sequencing is in fact being used 
for sequence census experiments pO] . The vast amounts of sequence being produced in the context 
of complex sequencing protocols, means that a detailed probabilistic understanding of random 
sequencing is likely to become increasingly important in the coming years. 
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